Section I: Introduction

1.1: Motivation and Previous Research

[text goes here]

1.2: SMART Question

[text goes here]

1.3: Description of the Dataset

[text goes here]

1.3.1: Initial Dataset Cleaning

First, we read in the data and set it up for analysis. The data is mostly cleaned, but we need a subset for calculating correlation, we need to change some data to be categorical, some data to be numerical, and we need to fix the dates so that they aren’t read in as characters.

Without doing anything, our dataset is as follows:

# read in dataset
airbnb <- data.frame(read.csv("NYC ABS.csv", header = TRUE))

#structure of dataset
str(airbnb)
## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "10/19/2018" "5/21/2019" "" "07-05-2019" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

After cleaning, our main dataset is described below:

str(airbnb)
## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : Factor w/ 5 levels "Bronx","Brooklyn",..: 2 3 3 2 3 3 2 3 3 3 ...
##  $ neighbourhood                 : Factor w/ 221 levels "Allerton","Arden Heights",..: 109 128 95 42 62 138 14 96 203 36 ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : Factor w/ 3 levels "Entire home/apt",..: 2 1 2 1 1 1 2 2 2 1 ...
##  $ price                         : num  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date, format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num  365 355 365 194 0 129 0 220 0 188 ...

Our secondary dataset (used to measure correlation) is described below:

str(airbnb_cor)
## 'data.frame':    48895 obs. of  6 variables:
##  $ price                         : num  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num  9 45 0 270 9 74 49 430 118 160 ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num  365 355 365 194 0 129 0 220 0 188 ...

1.3.2: Descriptive Statistics

Our dataset has 48895 observations. The continuous variables are described below (NA values omitted):

Table: Statistics summary.
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
Min Min. : 0 Min. : 1 Min. : 1 Min. : 0.0 Min. : 1 Min. : 0
Q1 1st Qu.: 69 1st Qu.: 1 1st Qu.: 3 1st Qu.: 0.2 1st Qu.: 1 1st Qu.: 0
Median Median : 101 Median : 2 Median : 9 Median : 0.7 Median : 1 Median : 55
Mean Mean : 142 Mean : 6 Mean : 29 Mean : 1.4 Mean : 5 Mean :115
Q3 3rd Qu.: 170 3rd Qu.: 4 3rd Qu.: 33 3rd Qu.: 2.0 3rd Qu.: 2 3rd Qu.:229
Max Max. :10000 Max. :1250 Max. :629 Max. :58.5 Max. :327 Max. :365
library(DescTools)

The standard deviations of the continuous variables are as follows:

  • price: 240.154
  • minimum_nights: 20.511
  • number_of_reviews: 44.551
  • reviews_per_month: 1.68
  • calcualted_host_listings_count: 32.953
  • availability_365: 131.622

The modes of the continuous variables are as follows:

  • price: 100
  • minimum_nights: 1
  • number_of_reviews: 0
  • reviews_per_month: 0.02
  • calcualted_host_listings_count: 1
  • availability_365: 0

The main categorical variables of interest are counted below:

#neighborhood group
xkabledply(table(airbnb$neighbourhood_group))
Table
V1
Bronx 1091
Brooklyn 20104
Manhattan 21661
Queens 5666
Staten Island 373
#room_type
xkabledply(table(airbnb$room_type))
Table
V1
Entire home/apt 25409
Private room 22326
Shared room 1160

1.4: Overview of Tests and Analysis Conducted

[text goes here]

Section II: Exploratory Data Analysis

2.1: Normality Plots

2.1.1: Histogram for Price (All Observations)

First, we look at the price distribution for the entire dataset:

library(ggplot2)
ggplot(data=airbnb, aes(price)) + 
  geom_histogram(bins = 100, 
                 col  = "dark blue", 
                 fill = "light blue", 
                alpha = .7) + # opacity
    labs(title = "Histogram of AirBnB Price (All Observations)", 
             x = "AirBnB Price", 
             y = "Frequency") +
     theme_grey()

2.1.2: Q-Q plot for Price (All Observations)

qqnorm(airbnb$price, pch = 20, main = "Q-Q Plot for AirBnB Prices (All Observations)")
qqline(airbnb$price, col = "black", lwd = 2)

Price is not normally distributed, and there appear to be a large number of outliers. If we remove those outliers, the distribution is close to normal.

airbnb_clean = outlierKD2(airbnb, price, rm = TRUE, boxplt = TRUE, qqplt = TRUE)

## Outliers identified: 2972 
## Propotion (%) of outliers: 6.5 
## Mean of the outliers: 659 
## Mean without removing outliers: 153 
## Mean if we remove outliers: 120 
## Outliers successfully removed

2.1.3: Histogram for Price (Outliers Removed)

library(ggplot2)
ggplot(data = airbnb_clean, aes(price)) +  
  geom_histogram(bins = 100, 
                 col  = "dark blue", 
                 fill = "light blue", 
                alpha = .7) +  # opacity
    labs(title = "Histogram of AirBnB Price (Outliers removed)", 
             x = "AirBnB Price", 
             y = "Frequency") +
     theme_grey()

2.1.4: Q-Q plot for Price (Outliers Removed)

qqnorm(airbnb_clean$price, pch = 20, main = "Q-Q Plot for AirBnB Prices (Outliers Removed)")
qqline(airbnb_clean$price, col = "black", lwd = 2)

2.2: Scatter and Box Plots

2.2.1: Scatter Plot for Price and Number of Reviews

Scatter plot for price and number of reviews without any transformations:

library(ggplot2)
library(ggpubr)
ggplot(airbnb, aes(x=price, y=number_of_reviews)) + 
  ggtitle("Number of Reviews vs Price Scatter Plot") + 
  xlab("Price ($)") + ylab("Number Of Reviews") + 
  geom_point(size = 1, shape = 18, color = "black") + 
  theme_bw()

There appears to be a an exponential relationship (exponential decline) between price and number of reviews.

Scatter plot for price and number of reviews taking the \[\log (reviews)\] to show linear trend:

library(ggplot2)
library(ggpubr)
ggplot(airbnb, aes(x=price, y=log(number_of_reviews))) + 
  ggtitle("Number of Reviews vs Price Scatter Plot") + 
  xlab("Price ($)") + ylab("Number Of Reviews (Natural Log)") + 
  geom_point(size = 1, shape = 18, color = "black")

2.2.2: Box Plot for Price and Neighborhood Group

Note: outliers extend past $1,000 per night, graph truncated for visibility and interpretation.

library(ggplot2)
ggplot(airbnb, aes(price, factor(neighbourhood_group))) + 
  geom_boxplot(color = "black", fill = c("light green", "pink","light blue", "yellow", "orange")) +
  labs(title = "Neighbourhood Group vs Price Box plot", x = "Price", y = "Neighbourhood group") +
  xlim(0, 1000)

2.2.3: Box Plot for Price and Room Type

Note: outliers extend past $1,000 per night, graph truncated for visibility and interpretation.

library(ggplot2)
ggplot(airbnb, aes(price, factor(room_type))) + 
  geom_boxplot(width = 0.7, color = "black", fill = c("light green", "yellow","light blue")) +
  labs(title = "Room type vs Price Box plot", x = "Price", y = "Room Type") + xlim(0, 1000)

2.3: Maps

2.3.1: Map of AirBnbs in NY by Neighbourhood group

#install.packages("plotly")
library(plotly)

fig <- airbnb

fig <- fig %>%

  plot_ly(

    lat = ~latitude,

    lon = ~longitude,

    color = ~neighbourhood_group,
    
    colors = "Set1",

    type = 'scattermapbox')

fig <- fig %>%

  layout(

    mapbox = list(

      style = 'open-street-map',

      zoom =9,

      center = list(lon = -73.97, lat = 40.71))) 


fig

2.3.2: AirBnB Average Price and Number of Listings by Neighborhood:

# create subset just for aggregating by mean
airbnb_map <- airbnb[ , c(6, 7, 8, 10)]
airbnb_map_means <- aggregate(.~neighbourhood, airbnb_map, mean)

# create subset for aggregating by count
airbnb_count <- airbnb_map
airbnb_count$count <- 1
airbnb_count <- airbnb_count[, c(1,5)]
airbnb_counter <- aggregate(.~neighbourhood, airbnb_count, sum)

# create full dataset from both subsets
airbnb_map_full <- cbind(airbnb_counter, airbnb_map_means)

# check that union occured correctly, then drop extra neighborhood value
all.equal(airbnb_map_full[, 1], airbnb_map_full[, 3]) # true!
airbnb_map_full <- airbnb_map_full[, -3]
#Load the library
library(ggplot2)
library(ggmap)

#Set your API Key
ggmap::register_google(key = "AIzaSyBuM2zUJqBlgjcki9tYS1emZr3awesSqac")

#map by price 
newyork.map <- get_map("New York", zoom = 10, scale = 1, maptype = "terrain")  
ggmap(newyork.map) +  geom_point(data = airbnb_map_full, aes(x = longitude, y = latitude, colour = price, size = count), alpha = 0.5) + 
  scale_colour_gradientn(colours=rainbow(3)) +
  labs(title = "Average AirBnb Price and Density by Neighborhood", x = "Longitude", y = "Latitude")

Section III: Correlation and ANOVA Tests

3.1: Correlation

3.1.1: Correlation Matrix for Airbnb Data

To answer our SMART question, we needed to investigate if any variables other than location or room type seemed to be related to the variable of interest, price. The other variables identified are as follows:

  • minimum_nights
  • number_of_reviews
  • reviews_per_month
  • calculated_host_listings_count
  • availability_365

As all of these variables are continuous, we have chosen to look at correlation. It is important to note that correlation does not imply causation. However, for the purposes of our SMART question, if for example a variable \(x\) is highly and positively correlated with price, we can say that the presence of \(x\) is associated with an increase in price and that will be sufficient.

Below we have a correlation matrix and accompanying correlation plot using all complete observations (note: reviews per month contains NA values).

loadPkg("faraway")
loadPkg("corrplot")
xkabledply(cor(airbnb_cor, use = "complete.obs"))
Table
price minimum_nights number_of_reviews reviews_per_month calculated_host_listings_count availability_365
price 1.0000 0.0255 -0.0359 -0.0306 0.0529 0.0782
minimum_nights 0.0255 1.0000 -0.0694 -0.1217 0.0735 0.1017
number_of_reviews -0.0359 -0.0694 1.0000 0.5499 -0.0598 0.1936
reviews_per_month -0.0306 -0.1217 0.5499 1.0000 -0.0094 0.1858
calculated_host_listings_count 0.0529 0.0735 -0.0598 -0.0094 1.0000 0.1829
availability_365 0.0782 0.1017 0.1936 0.1858 0.1829 1.0000
airbnb_corplot = cor(airbnb_cor, use = "complete.obs")
corrplot(airbnb_corplot, method = "circle")

As shown, no variable of interest shows strong Pearson correlation with price, but minimum_nights and availability_365, number_of_reviews and availability_365, and calculated_host_listings_count and availability_365 show some evidence of positive correlation.

Additionally, as expected, we see strong correlation between the total number of reviews and the reviews per month. This makes sense; we would expect that the more reviews in total a unit has, the more reviews it receives per month. This correlation is significantly significant, as shown in subsection 3.1.2.

While the correlation between price and number_of_reviews is not high, as shown from our previous scatterplots there does appear to be an exponential relationship between the two variables, with the highest-priced units receiving very few reviews total. In subsection 4.1.3, we see that this relationship is not strong but that there is some statistically significant evidence of correlation. This relationship makes sense; number of reviews can be a proxy for total number of stays in the unit, and one would expect that expensive units are stayed in less frequently.

3.1.2: Correlation Between Reviews per Month and Total Reviews

cor.test(x=airbnb_cor$reviews_per_month, y=airbnb_cor$number_of_reviews)
## 
##  Pearson's product-moment correlation
## 
## data:  airbnb_cor$reviews_per_month and airbnb_cor$number_of_reviews
## t = 130, df = 38841, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.543 0.557
## sample estimates:
##  cor 
## 0.55

3.1.3: Correlation Between Number of Reviews (Y) and Price (X)

cor.test(y=airbnb$number_of_reviews, x=airbnb$price)
## 
##  Pearson's product-moment correlation
## 
## data:  airbnb$price and airbnb$number_of_reviews
## t = -11, df = 48893, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0568 -0.0391
## sample estimates:
##    cor 
## -0.048

3.2: ANOVA Tests

3.2.1: Testing for Differences in Price by Neighborhood Group

#anova test for price and neighborhood groups
anova_price_group = aov(price ~ neighbourhood_group, data=airbnb)
summary(anova_price_group) -> sum_anova_price_group
xkabledply(sum_anova_price_group, title = "ANOVA result summary for Neighborhood Groups")
ANOVA result summary for Neighborhood Groups
Df Sum Sq Mean Sq F value Pr(>F)
neighbourhood_group 4 7.96e+07 19897739 355 0
Residuals 48890 2.74e+09 56051 NA NA
tukeyAoV_pg <- TukeyHSD(anova_price_group)
tukeyAoV_pg
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ neighbourhood_group, data = airbnb)
## 
## $neighbourhood_group
##                           diff     lwr   upr p adj
## Brooklyn-Bronx           36.89   16.81  57.0 0.000
## Manhattan-Bronx         109.38   89.34 129.4 0.000
## Queens-Bronx             12.02   -9.33  33.4 0.539
## Staten Island-Bronx      27.32  -11.42  66.1 0.305
## Manhattan-Brooklyn       72.49   66.17  78.8 0.000
## Queens-Brooklyn         -24.87  -34.58 -15.2 0.000
## Staten Island-Brooklyn   -9.57  -43.32  24.2 0.938
## Queens-Manhattan        -97.36 -106.99 -87.7 0.000
## Staten Island-Manhattan -82.06 -115.79 -48.3 0.000
## Staten Island-Queens     15.29  -19.23  49.8 0.746

3.2.1.1 Tukey HSD Results:

At a statistically significant level with \(\alpha = 0.05\), we see that there are statistically significant differences between Manhattan and all other boroughs, between Brooklyn and the Bronx, and between Queens and Brooklyn. Differences are not statically significant between Staten Island, Queens, and the Bronx.

This aligns with our previous visual findings, that Manhattan appeared to the most expensive borough with Brooklyn coming in at second. In terms of using location to determine price, we can conclude that certain boroughs are more useful predictors than others. For example, knowing that a unit is in Manhattan as opposed to in the Bronx would be useful, but with no other information, a unit in Staten Island may be priced roughly the same as a unit in Queens. Still, given that at least some locations are priced differently from each other, we can conclude that yes, location does have an effect on price.

3.2.2: Testing for Differences in Price by Room Type

#anova test for price and room type
anova_price_room = aov(price ~ room_type, data=airbnb)
summary(anova_price_room) -> sum_anova_price_room
xkabledply(sum_anova_price_room, title = "ANOVA result summary for Room Type")
ANOVA result summary for Room Type
Df Sum Sq Mean Sq F value Pr(>F)
room_type 2 1.85e+08 92512441 1717 0
Residuals 48892 2.63e+09 53892 NA NA
tukeyAoV_pr <- TukeyHSD(anova_price_room)
tukeyAoV_pr
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = price ~ room_type, data = airbnb)
## 
## $room_type
##                                diff  lwr     upr p adj
## Private room-Entire home/apt -122.0 -127 -117.02 0.000
## Shared room-Entire home/apt  -141.7 -158 -125.33 0.000
## Shared room-Private room      -19.7  -36   -3.27 0.014

4.2.2.1 Tukey HSD Results:

At a statistically significant level with \(\alpha = 0.05\), we see that there are statistically significant differences across all room types. This lines up with our visual representations as well, where we saw that an entire home or apartment was significantly preferred (priced higher) than either room. While not as visually obvious, a private room is also priced statistically significantly higher than a shared room. We can conclude that room type does likely have an effect on price.

Section IV: Conclusion

4.1: Analysis of SMART Question

4.2: Limitations of Dataset

4.3: Next Steps